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1. INTRODUCTION 

At present, worldwide broadband distribution has increased the number of Internet users. With 
faster connections, hosting and video sharing services are becoming popular among users [1]. The 
availability of resources over the Internet and broadband connection enables the emergence of sophisticated 
new platforms. In this way, YouTube is a one well-known video content publishing platform with social 
networking features, such as support for posting text comments to provide interactions between producers 
(channel owners) and viewers [2]. 

Recently, YouTube has used monetization systems to reward producers, stimulating them to 
produce high quality original content and increase the amount of visualization. After the use of this system, 
the platform is flooded with unwanted content, typically low quality information known as spam. Spam is the 
use of an electronic messaging system to send unsolicited messages, especially advertisements, as well as 
repeat messages on the same website. For social spam, it can be done in many ways, including mass 
messaging, cruelty, humiliation, hate speech, malicious links, fake reviews, fake hints, and 
personal information [3]. 

Indeed, it is a problem that could become critical. It caused the user disable comments on their 
videos because the most of comments are spam. Until now, the research to detect the spam YouTube 
comment using machine learning technique is still lacking. 
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To overcome the problems that appear, this paper proposed technique used for video sharing spam 
comment feature detection. This works evaluate the performance of spam comment feature detection 
using accuracy. 


2. RELATED WORK 

Spam is related to low quality of information and consists of undesired content [1]. Usually, spam 
found in texts, video and images [4-5]. Most of spam is used to manipulate internet user to obtain personal 
information such as phishing and malware. Spam also used to make commercial advertising [1]. For spam 
message, it usually works by flooding the Internet with the same message in order to force user to receive it. 
Besides, video spam is a low quality content of the video that publish on YouTube by malicious users [6]. 
There are many researchers related to spam in the existing study such as blog spam [7], web spam [8], twitter 
spam [9], email spam [10], YouTube spam [1] and SMS spam [11]. 

Ham is a message that is not Spam. In other words, "non-spam", or "good message" [12]. It should 
be considered a high quality of information and meaningful words [13]. 

Figure | shows example of spam and ham comments posted on YouTube. 


2 Mr. Ga Ga Bu 1 day ago 
http://rich-birds.cc/?i=2254076 
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View reply v 


MexmaH 2 months age 
this music; 


Figure 1. Example of spam and ham comments posted on YouTube 








2.1. Pre-Processing 

In the pre-processing step, the features are firstly extracted. The subject field contains the data that 
need to be pre-processing [7]. Therefore, there are a few steps in this phase, which are tokenization, stop 
words and lemmatization. These processes were doing to remove noise, redundant and also words that 
common English use that will affect the detection phase [13]. Most of the existing research doing pre- 
processing process before continuing to the next process. Table 1 shows the spam detection steps and the 
process used by researchers. 

Based on Table 1, the research papers for Tubespam: Comment Spam Filtering on YouTube and 
Combating Comment Spam with Machine Learning Approaches is using four techniques to detect spam 
which is Pre-processing, Features Extraction, Classification and Evaluation. 

Feature extraction is the process of identifying features or type of information contained within the 
documents. After these features are extracted, then only the machine learning algorithms can find the target 
concept descriptions of categories. 

The next paper which is Towards Filtering of SMS spam messages using Machine Learning Based 
Technique is used six techniques, namely Pre-processing, Feature Selection, Classifier Training, Classifier 
Testing, Classification Result and Performance evaluation. 

For research about Statistical Twitter Spam Detection Demystified-Performance, Stability and 
Scalability are used four techniques which are Data Collection, Feature Selection, Classification 
and Evaluation. 

In paper KidsTube: Detection, Characterization and Analysis of Child Unsafe Content & Promoters 
on YouTube used only three techniques to detect spam which is Data Collection, Features Selection 
and Classification. 

Next is the research about Detecting Video Spammers in YouTube Social Media, the researcher 
used four techniques to detect video spammers namely Data Collection, Data Pre-processing, Feature 
Construction and Classification. 

Lastly, research paper with title Data Mining Based spam Detection System for YouTube spam 
using three techniques for detecting spam which is Data Collection, Classification and Evaluation. 
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Table 1. Steps and Process for Spam Detection 





Author 


Title 


Detection Technique 





[1] 


[4] 


[13] 


[9] 


[2] 


[14] 


[6] 


Tubespam: Comment Spam Filtering on 
YouTube 


Combating Comment Spam with 
Machine Learning Approaches 


Towards Filtering of SMS spam 
messages using Machine Learning 
Based Technique 


Statistical Twitter Spam Detection 
Demystified-Performance, Stability and 
Scalability 


KidsTube: Detection, Characterization 
and Analysis of Child Unsafe Content & 
Promoters on YouTube 

Detecting Video Spammers in YouTube 
Social Media 


Data Mining Based spam Detection 
System for YouTube spam 


eam acc a ac dr Se na) ad eae ag 


Pre-processing 
Features Extraction 
Classification 
Evaluation 
Pre-processing 
Feature Extraction 
Classification 
Evaluation 
Pre-processing 
Feature Selection 
Classifier Training 
Classifier Testing 
Classification Result 
Performance evaluation 
Data Collection 
Feature Selection 
Classification 
Evaluation 
Data Collection 
Features Selection 
Classification 
Data Collection 
Data Pre-processing 
Feature Construction 
Classification 
Data Collection 
Classification 
Evaluation 





2.2. Feature Selection 
Feature selection is known as attribute selection, variable selection or variable subset selection. It is 
the process of selecting a variable for use in model construction. Feature selection techniques are used for 
four reasons which are: 
Simplification of models to make it easier to interpret by researchers. 


a. 


b. 
c. 
d 


Shorter training times. 


Avoiding the curse of dimensionality. 

Enhanced generalization by reducing over fitting. 
Feature selection is a very important task for the text spam filtering. Selected features should be 

correlated to the message type such that accuracy for detection of spam message can be increased [11]. Spam 

and ham messages can be differentiated using various features. Table 2 presents the selected features used to 

detect spam. 


Table 2. Features Selection used in Existing Projects 





Features 


Author 





Bag-of-words 1] 
Post-comment similarity v 
Inter-comment similarity 

Interval between post and comment 
Number of words in the comment 
Number of sentences in the comment 
Comment length 

Phone information 

E-mail information 

URL link 

Black word list 

Stop words ratio 

Presence of symbol 

Presence of dots 

Presence of emotions 

Lower-case words 

Uppercase words 

Keyword specific 

Number of digits 

Channel age 

The channel average upload 


= 
rN 
2 


ROR RRR RA Ray 


[13] 
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Based on Table 2, the most features that have been used to detect spam are bag-of-words, post 
comments similarity, number of words in the comment, number of sentences in comment, comment length, 
phone information, URL link and number of digits. 


2.3. Classification of Techniques 

Various techniques used in experiment to evaluate the performance of spam detection. Initially, 
feature selection is performed and then extracts the features. After extraction, classification of techniques 
used to get evaluation performance, such as Decision Trees (DTs), Naive Bayes and so on [13]. 
Classification techniques are used to detect the accuracy of spam itself. This technique has worked using 
tools such as WEKA and Rapid Miner. Table 3 shows the machine learning techniques used for six papers. 


Table 3. Machine Learning Techniques 
Author Title Detection Technique 
[1] Tubespam: Comment Spam Filtering on 1 Decision trees (CART) 

YouTube 2 K -nearest neighbors (k -NN) 
3. Logistic regression (LR) 
4. Bernoulli Naive Bayes (NB-B) 
5. | Gaussian Naive Bayes (NB-G) 
6 
7 
8 








Multinomial Naive Bayes (NB-M) 
Random Forest (RF) 
Support vector machines with linear kernel 
(SVM-L) 

9. Support vector machines with polynomial 
kernel (SVM-P) 

10. Support vector machines with a Gaussian 

kernel (SVM-R) 

J48 (C4.5 Algorithm) 

Random Forest (RFT) 

Decision Tree 

SVM 

Multilayer Neural Network 

Naive Bayes 

Logistic Regression 

J48 

Decision Table 

Random Forest 

K -nearest neighbor 

Weight K -nearest neighbor 

Naive Bayes 

Random Forest 

C5.0 

Boosted Logistic Regression 

Stochastic Gradient Boosting Machine 

Neural Network 

Random Forest 

K-nearest Neighbor 

Decision Tree 

Functional Tree 

J48 

Random Forest 

Bayes Network 

Naive Bayesian 


[4] Combating Comment Spam with 
Machine Learning Approaches 


[13] Towards Filtering of SMS spam 
messages using Machine Learning 
Based Technique 


[9] Statistical Twitter Spam Detection 
Demystified-Performance, Stability and 
Scalability 


[2] KidsTube: Detection, Characterization 
and Analysis of Child Unsafe Content 
& Promoters on YouTube 

[14] Detecting Video Spammers in YouTube 
Social Media 


SAN BD AND ye USE QOL OY ee Oo NO ee ee. To ee Ne 





Based on Table 3, the first author with research about Comment Spam Filtering in YouTube used 
ten comparison of classification algorithm which are Decision trees (CART), K -nearest neighbors (k -NN), 
Logistic regression (LR), Bernoulli Naive Bayes (NB-B), Gaussian Naive Bayes (NB-G), Multinomial Naive 
Bayes (NB-M), Random Forest (RF), Support vector machines with linear kernel (SVM-L), Support vector 
machines with polynomial kernel (SVM-P) and Support vector machines with Gaussian kernel (SVM-R). 

Second author with research about Combating Comment Spam with Machine Learning Approaches 
used five comparisons of classification algorithm which are J48 (C4.5 Algorithm), Random Forest (RFT), 
Decision Tree, SVM and Multilayer Neural Network. 

Next, for third research which is Towards Filtering of SMS spam messages using Machine Learning 
Based Technique compared five classification algorithms, namely Naive Bayes, Logistic Regression, J48, 
Decision Table and Random Forest. 
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The fourth research which is Statistical Twitter Spam Detection Demystified-Performance, Stability 
and Scalability compared eight classification algorithms which are K -nearest neighbor, Weight K -nearest 
neighbor, Naive Bayes, Random Forest, C5.0, Boosted Logistic Regression, Stochastic Gradient Boosting 
Machine and Neural Network. 

The fifth research with title KidsTube: Detection, Characterization and Analysis of Child Unsafe 
Content & Promoters on YouTube have been compared three classification algorithm which is Random 
Forest, K-nearest Neighbor and Decision Tree. 

And the last is Detecting Video Spammers in YouTube Social Media used five classification 
algorithms to be compared to get high accuracy, namely Functional Tree, J48, Random Forest, Bayes 
Network and Naive Bayesian. 

The most classification algorithms that have been used by existing research are Naive Bayesian, 
Random Forest, Decision Tree and K —nearest Neighbor. 


Table 4. Comparison Table in Detection of Spam 








Author [1] [13] [9] [2] [14] 

Year 2015 2017 2017 2016 2017 

Accuracy Above 90% Above 90% Above 90% Above 90% Above 90% 

Algorithm RF, NB-B Random Forest Random Forest,C5.0 Random Fates a Bayes Metwotk, 
Nearest Neighbor Naive Bayesian 

Type of spam YouTube spam SMS spam Twitter spam YouTube spam YouTube spam 

Dataset UCI Machine Learning Not stated Not stated Not stated Not stated 





Table 4 shows the comparison between previous research projects in detection of spam. There is 
five research projects has been listed to compare the result with accuracy of detection spam and the algorithm 
used. This table also shows that the most accurate in detection of spam is using Random Forest algorithm 
with result above 90%. Figure 2 shows how Random Forest works. 


Feature (f) Feature(f) 


a 


Figure 2. Random Forest model 


Random Forest can give the most accurate result because it is work by built multiple of decision 
trees and merges it together to get stable prediction. 


3. RESEARCH METHOD 

In this section discuss the methodology used for video sharing spam comment feature detection. It 
consists of data collection, tokenization, lemmatization, feature selection and classification modules. Several 
experiments are conducted in order to identify the most suitable technique to detect spam comment. The 
performance evaluation used in this research is Accuracy. 

There is two modules which is Module 1: Data Collection and Module 2: Text Mining. Figure 3 
shows framework used in this work. 
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Framework for Video Spam Comment Feature Selection using Machine 
Learning 


Module I: 
Data Collection 


Module 2: Text Mining 


Tokenization Lemmatization 
=) 


Feature 


Video spam 
comment 





u 


comment Module Selection 


Module 


Result Detection of Video 
comment ham and spam 


Figure 3. Framework used in Video Spam Comment Features Selection using Machine Learning Technique 


3.1. Datasets 

Data collection for this work is used for conducting experiments. In order to detect spam comment, 
a collection of spam and ham comment must be selected from the UCI Machine Learning repository. 
Because of time constraint to collect primary data, the existing spam dataset that has been collected by the 
previous researchers were chosen for this work [1]. There are 350 real comments extracted from a video and 
it is divided into two which is 175 comments were spam and another 175 comments were ham comments. 


3.2. Tokenization 
The purpose of tokenization is to split the video comment into individual words in order to 
smoothen out the lemmatization process. For this work, tokenization has been done using Microsoft Excel. 


3.3. Lemmatization 

Lemmatization is a process of grouping the similar words. For this work, the process is instead used 
to group words that exactly same. It is because in most cases, the video comment attacker will simply use 
different abbreviations of words. This process is done by using Rapid Miner. 


3.4. Feature Selection 

Feature Selection is important for spam comment detection. It is because the accuracy of detection 
spam comments depends on the features that has been selected [13]. In this experiment, only datasets that 
contain texts of comments is used. The features that have been extracted and evaluated for this works are 
bag-of-word model. The features are selected based on comparison that have been stated in related work. 


3.5. Classification 

After extracting features, classification is tested using WEKA tool. There is six machine learning 
algorithm are used in this experiment which are Random Tree, Random Forest, Naive Bayes, KStar, Decision 
Table and Decision Stump. Table 5 shows the classification algorithms used in this experiment. The accuracy 
rate has been used to compare the algorithm's performance. 


Table 5. Classification Algorithms used in Work 
Classification Technique 
RT Random Tree 
RF Random Forest 
NB Naive Bayes 
K* KStar 
DTs Decision Tree 
DS Decision Stump 
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4. RESULTS AND ANALYSIS 

In this section, it is explained the results of research. Experiments are performed to evaluate the 
performance of proposed spam comment detection. The first step is selected features on basic behavior of 
spam and ham comments and then extracts the features from dataset to get featured vector. After extraction 
of features, various classifications of algorithm such as Random Tree, Random Forest, Naive Bayes, KStar, 
Decision Table and Decision Stump are applied to get performance accuracy. In Table 6 show the results of 
proposed approach on various machine learning algorithms. 


Table 6. Results of Proposed Approach on Various Machine Learning Algorithms 








Feature Selection (words) Accuracy (%) 
RT RF NB K* DTs DS 
1-39 82.00 87.14 82.57 82.86 83.71 58.86 
1-78 84.57 89.14 83.43 84.86 68.29 63.43 
1-117 85.43 90.29 81.74 85.14 68.29 63.44 
1-156 86.2 90.00 83.71 85.14 76.86 65.71 
1-195 86.86 90.57 84.00 84.58 76.86 65.71 





Based on Table 6, the highest accuracy, using Random Tree classification is 86.86% by using 195 
words while the low accuracy is 82% by using 39 words. 

For Random Forest classification algorithm, the highest result of accuracy is 90.57%. It used 195 
words and the lowest accuracy is 87.14%. It used 39 words. 

The highest accuracy for Naive Bayes is 84% with 195 words and the lowest result is 81.74% by 
using 117 words. 

By using KStar classification, the highest accuracy is 85.14% by using 117 and 156 words 
respectively. The lowest accuracy, using KStar classification is 82.86% with 39 words. 

For Decision Tree classification algorithm, the highest accuracy is 83.71% by using 39 words and 
the lowest accuracy is 68.29% by using 78 and 117 words. 

Lastly, for Decision Stump classification, the highest accuracy is 65.71% by using 156 and 195 
words. The lowest accuracy is 58.86% and it used 39 words to be analyzed. 


5. CONCLUSION 

After comparing the performance for various machine learning algorithms, Random Forest 
Classification gives the highest result of accuracy which is 90.57% for 1 to 195 words of features selection. 
The lowest accuracy is from Decision Stump Classification which is 58.86% for 1 to 39 words of features 
selection. So that, Random Forest Classification were achieved the best classification results with 
high accuracy. 

This work proposed a technique for video sharing spam comments detection to overcome the 
problems that have been faced by user with media social. There is 195 words of features selection has been 
used in six machine learning algorithms which are Random Tree, Random Forest, Naive Bayes, KStar, 
Decision Table and Decision Stump to get the highest accuracy of spam detection. Out of all classification 
algorithms, Random Forest Classification gives the best result with 90.57% accuracy. 
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